Several pods do not start, encounter "too many open files" error · Issue #2087 · kubeflow/manifests · GitHub

您所在的位置:网站首页 unable to open database file: too many open files Several pods do not start, encounter "too many open files" error · Issue #2087 · kubeflow/manifests · GitHub

Several pods do not start, encounter "too many open files" error · Issue #2087 · kubeflow/manifests · GitHub

2024-07-17 07:35| 来源: 网络整理| 查看: 265

In setting up a kubeflow cluster using the master branch at commit 3dad839f. Four pods encounter too many open files error.

For the k8s cluster, I'm using a local k3d cluster on MacOS (11.6.1): https://k3d.io

At end of deploying kubeflow these are the status of 4 pods.

kubectl get pod -A | grep -v Run | grep -v NAME kubeflow ml-pipeline-8c4b99589-gcvmz 1/2 CrashLoopBackOff 15 63m kubeflow kfserving-controller-manager-0 1/2 CrashLoopBackOff 15 63m kubeflow profiles-deployment-89f7d88b-hp697 1/2 CrashLoopBackOff 15 63m kubeflow katib-controller-68c47fbf8b-d6mpj 0/1 CrashLoopBackOff 16 63m

The cluster has been torn down and rebuilt several times. Each time the same 4 pods encounter the too many open files error. All other pods successfully attain Running status.

According to ulimit -n on the nodes, the nodes have a very high setting for that limit: 1048576. Since this is run on MacOS, configured launchctl to increase the maxfiles from 256 to 524288.

I'm new to kubeflow, so any guidance offered will be appreciated.

Following are the diagnostic data collected:

Log extract from failed pods kubectl logs ml-pipeline-8c4b99589-gcvmz Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/ml-pipeline-8c4b99589-gcvmz. Please use `kubectl.kubernetes.io/default-container` instead 2021/12/11 13:01:59 too many open files kubectl logs kfserving-controller-manager-0 -c manager > {"level":"error","ts":1639227716.1910038,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.InferenceService Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:547"} {"level":"error","ts":1639227716.1911373,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1alpha1.TrainedModel Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:547"} {"level":"error","ts":1639227716.1912212,"logger":"entrypoint","msg":"unable to run the manager","error":"too many open files","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nmain.main\n\t/go/src/github.com/kubeflow/kfserving/cmd/manager/main.go:183\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"} kubectl logs profiles-deployment-89f7d88b-hp697 -c manager I1211 13:02:40.188855 1 request.go:645] Throttling request took 1.036224909s, request: GET:https://10.43.0.1:443/apis/flows.knative.dev/v1?timeout=32s 2021-12-11T13:02:41.646Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": ":8080"} 2021-12-11T13:02:41.646Z ERROR setup unable to create controller {"controller": "Profile", "error": "Failed to start file watcher: too many open files", "errorVerbose": "too many open files\nFailed to start file watcher\ngithub.com/kubeflow/kubeflow/components/profile-controller/controllers.(*ProfileReconciler).SetupWithManager\n\t/workspace/controllers/profile_controller.go:381\nmain.main\n\t/workspace/main.go:93\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:204\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1374"} runtime.main /usr/local/go/src/runtime/proc.go:204 kubectl logs katib-controller-68c47fbf8b-d6mpj >>>>> {"level":"error","ts":1639227826.322595,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.Suggestion Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:529"} {"level":"error","ts":1639227826.3227415,"logger":"controller-runtime.manager","msg":"error received after stop sequence was engaged","error":"Timeout: failed waiting for *v1beta1.Experiment Informer to sync","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nsigs.k8s.io/controller-runtime/pkg/manager.(*controllerManager).engageStopProcedure.func1\n\t/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/manager/internal.go:529"} {"level":"error","ts":1639227826.32281,"logger":"entrypoint","msg":"Unable to run the manager","error":"too many open files","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/[email protected]/zapr.go:132\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1beta1/main.go:128\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:255"}

kubeflow deployed using kustomize build ${component} | kubectl apply -f - on each of the following compnonents in the order shown:

# cert manager common/cert-manager/cert-manager/base \ common/cert-manager/kubeflow-issuer/base \ # istio common/istio-1-9/istio-crds/base \ common/istio-1-9/istio-namespace/base \ common/istio-1-9/istio-install/base \ #DEX common/dex/overlays/istio \ # OIDC Auth Service common/oidc-authservice/base \ # knative serving common/knative/knative-serving/base \ common/istio-1-9/cluster-local-gateway/base \ # inference event logging common/knative/knative-eventing/base \ # kubeflow namespace common/kubeflow-namespace/base \ # kubeflow roles common/kubeflow-roles/base \ # kubeflow istio resources common/istio-1-9/kubeflow-istio-resources/base \ # kubeflow pipelines apps/pipeline/upstream/env/platform-agnostic-multi-user-pns \ # KFServing apps/kfserving/upstream/overlays/kubeflow \ # Katib apps/katib/upstream/installs/katib-with-kubeflow \ # Central Dashboard apps/centraldashboard/upstream/overlays/istio \ # Admission Controler apps/admission-webhook/upstream/overlays/cert-manager \ # Notebooks apps/jupyter/notebook-controller/upstream/overlays/kubeflow \ # Jupyter web app apps/jupyter/jupyter-web-app/upstream/overlays/istio \ # Profiles + KFAM apps/profiles/upstream/overlays/kubeflow \ # Volumes Web app apps/volumes-web-app/upstream/overlays/istio \ # Tensorboard apps/tensorboard/tensorboards-web-app/upstream/overlays/istio \ # Training Operator apps/training-operator/upstream/overlays/kubeflow \ # User Namespace common/user-namespace/base \ Platform MacOS: 11.6.1 MacBookPro 2019 (Intel), 16GB RAM Software Versions: k3d version k3d version v5.1.0 k3s version v1.21.5-k3s2 (default) docker version Client: Cloud integration: v1.0.22 Version: 20.10.11 API version: 1.41 Go version: go1.16.10 Git commit: dea9396 Built: Thu Nov 18 00:36:09 2021 OS/Arch: darwin/amd64 Context: default Experimental: true Server: Docker Engine - Community Engine: Version: 20.10.11 API version: 1.41 (minimum version 1.12) Go version: go1.16.9 Git commit: 847da18 Built: Thu Nov 18 00:35:39 2021 OS/Arch: linux/amd64 Experimental: false containerd: Version: 1.4.12 GitCommit: 7b11cfaabd73bb80907dd23182b9347b4245eb5d runc: Version: 1.0.2 GitCommit: v1.0.2-0-g52b36a2 docker-init: Version: 0.19.0 GitCommit: de40ad0 kubectl version Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"darwin/amd64"} Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.5+k3s2", GitCommit:"724ef700bab896ff252a75e2be996d5f4ff1b842", GitTreeState:"clean", BuildDate:"2021-10-05T19:59:14Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"} kustomize version Version: {KustomizeVersion:3.2.0 GitCommit:a3103f1e62ddb5b696daa3fd359bb6f2e8333b49 BuildDate:2019-09-18T16:26:36Z GoOs:darwin GoArch:amd64} k3d cluster nodes kubectl get node -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME k3d-kubeflow-server-0 Ready control-plane,master 79m v1.21.5+k3s2 172.19.0.2 Unknown 5.10.76-linuxkit containerd://1.4.11-k3s1 k3d-kubeflow-agent-0 Ready 79m v1.21.5+k3s2 172.19.0.3 Unknown 5.10.76-linuxkit containerd://1.4.11-k3s1 ulimit for the two nodes ulimit -a # on server node core file size (blocks) (-c) 0 data seg size (kb) (-d) unlimited scheduling priority (-e) 0 file size (blocks) (-f) unlimited pending signals (-i) 51481 max locked memory (kb) (-l) 64 max memory size (kb) (-m) unlimited open files (-n) 1048576 POSIX message queues (bytes) (-q) 819200 real-time priority (-r) 0 stack size (kb) (-s) 8192 cpu time (seconds) (-t) unlimited max user processes (-u) unlimited virtual memory (kb) (-v) unlimited file locks (-x) unlimited ulimit -a # on worker node core file size (blocks) (-c) 0 data seg size (kb) (-d) unlimited scheduling priority (-e) 0 file size (blocks) (-f) unlimited pending signals (-i) 51481 max locked memory (kb) (-l) 64 max memory size (kb) (-m) unlimited open files (-n) 1048576 POSIX message queues (bytes) (-q) 819200 real-time priority (-r) 0 stack size (kb) (-s) 8192 cpu time (seconds) (-t) unlimited max user processes (-u) unlimited virtual memory (kb) (-v) unlimited file locks (-x) unlimited


【本文地址】


今日新闻


推荐新闻


    CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3